Generating Multimodal References 1 Generating Multimodal References Generating Multimodal References 2
نویسندگان
چکیده
This paper presents a new computational model for the generation of multimodal referring expressions, based on observations in human communication. The algorithm is an extension of the graph-based algorithm proposed by Krahmer et al. (2003) and makes use of a so-called Flashlight Model for pointing. The Flashlight Model accounts for various types of pointing gestures of different precisions. Based on a notion of effort the algorithm produces referring expressions combining language and pointing gestures. The algorithm is evaluated using two production experiments, with which spontaneous data is gathered on controlled input. The output of the algorithm coincides to a large extent with the utterances of the participants. However, an important difference is that the participants tend to produce overspecified referring expressions while the algorithm generates minimal ones. We briefly discuss ways to generate overspecified multimodal references. Generating Multimodal References 3 Introduction Human-computer interaction (HCI) studies the interaction between human users and computers which takes place at the user interface. Advances in HCI provide evidence that the use of multiple modalities, such as speech and gesture, for both input and output may result in systems that are more natural and efficient to use (Oviatt 1999). Consequently, current research in HCI shows an increased interest in developing interfaces that closely mimic human-human communication, and the development of “virtual characters” or “embodied conversational agents” (ECAs) that are able to communicate both verbally and non-verbally about a concrete spatial domain clearly fits this interest (e.g. Kopp et al., 2003; Cassell et al., 2000). A subtask that is addressed in many systems is that of identifying a certain object in a visual context accessible to both user and system. This can be done for example by an ECA that points to the object, possibly in combination with a linguistic referring expression (RE). With the design of ECAs the question arises of how referring expressions in which linguistic information and gestures are combined should be generated automatically, but also how such multimodal REs are produced by humans (Beun & Cremers, 1998; Byron, 2003). The generation of referring expressions (GRE) is a central task in Natural Language Generation (NLG), and various algorithms which automatically produce REs have been developed (recent examples include Van Deemter & Krahmer, 2007; Van Deemter, 2002, 2006; Gatt 2006; Jordan & Walker, 2005; Gardent, 2002; Krahmer, et al. 2003). Existing GRE algorithms generally assume that both speaker and addressee have access to the same information. In most cases this information is represented by a knowledge base that contains the objects and their properties present in the domain of conversation. A typical algorithm takes as input a single object (the target) and a set of objects (the distractors) from which the target Generating Multimodal References 4 object needs to be distinguished (borrowing terminology from Dale & Reiter, 1995). The task of a GRE algorithm is to determine which set of properties is needed to single out the target from the distractors. This is known as content determination for REs. On the basis of this set of properties a distinguishing description in natural language can be generated; a description which applies to the target but not to any of the distractors. In general, there are multiple distinguishing descriptions for a given target. Consider, for instance, the chess configuration in Figure 1, with a circle around the target. This target can be described exclusively with linguistic features that express, say, that the target is a knight and that it is white. However, “the white knight” is not uniquely identifying, because there are two white knights on the board. Consequently, more or other information is needed to distinguish the target knight. For instance, the expression “the white knight” can be extended with several relational properties: “in row 5”, “at position E5”, “that is threatened by a black pawn”, etc. Alternatively, a multimodal referring expression may be used, consisting of a pointing gesture plus a linguistic description such as “this knight”. Arguably, such a multimodal RE would be easier to process than an overlong linguistic description, certainly when the addressee is a beginning chess player who is perhaps not familiar yet with the structure of the board or with the names of the
منابع مشابه
A Decision Support System for Urban Journey Planning in Multimodal Public Transit Network
The goal of this paper is to develop a Decision Support System (DSS) as a journey planner in complex and large multimodal urban network called Rahyar. Rahyar attempts to identify the most desirable itinerary among all feasible alternatives. The desirability of an itinerary is measured by a disutility function, which is defined as a weighted sum of some criteria. The weight...
متن کاملJoint interpretation of input speech and pen gestures for multimodal human-computer interaction
This paper describes out initial work in semantic interpretation of multimodal user input that consist of speech and pen gestures. We have designed and collected a multimodal corpus of over a thousand navigational inquiries around the Beijing area. We devised a processing sequence for extracting spoken references from the speech input (perfect transcripts) and interpreting each reference by gen...
متن کاملCreating Multimodal Texts in Language Education – Negotiations at the Boundary
How students negotiate what to include, and exclude, in multimodal texts is, in this article, explored in order to find out how, and to what extent, creating multimodal texts in language education can be regarded as a literacy practice at the boundary. When students create multimodal texts in classrooms they may incorporate contextual references from domains outside of education, such as popula...
متن کاملReference Problems in Chameleon
This paper discusses the problem of endophoric/deictic references in multimodal dialogues involving speech and pointing gestures. The dialogue system in focus is a building information system implemented using a general workbench architecture “Chameleon”. Chameleon and the building information system have been developed within the IntelliMedia 2000+ project initiated 1996 by the Institute of El...
متن کاملCenter for Multimodal Solutions for Congestion Mitigation
.................................................................................................................................. iii EXECUTIVE SUMMARY ........................................................................................................... iv CHAPTER 1 BACKGROUND ...................................................................................................1 Problem Sta...
متن کامل